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Abstract!  This  paper  describes  an  experiment  on  the  effect  of  insertions  and  deletions  on  the  path 
length  of  unbalanced  binary  search  trees.  Given  a  random  binary  tree,  repeatedly  Inserting  and 
deleting  nodes  yields  a  tree  that  is  no  longer  random.  The  expected  internal  path  length  differs 
when  different  deletion  algorithms  are  used.  Previous  empirical  studies  indicated  that  expected 
internal  path  length  tends  to  decrease  after  repeated  insertions  and  asymmetric  deletions.  This 
study  shows  that  performing  a  larger  number  of  insertions  and  asymmetric  deletions  actually 
increases  the  expected  internal  path  length,  and  that  for  sufficiently  large  trees,  the  expected 
internal  path  length  becomes  worse  than  that  of  a  random  tree.  However,  with  a  symmetric 
deletion  algorithm,  the  results  indicate  that  performing  a  large  number  of  insertions  and  deletions 
decreases  the  expected  internal  path  length,  and  that  the  expected  internal  path  length  remains 
better  than  that  of  a  random  ties. 

T 

I 

This  research  was  sponsored  in  part  by  the  Office  of  Navel  Research  under  contract  N00014-70-C- 
0370. 


1.  Introduction 


A  binary  tree  created  by  inserting  n  randomly  chosen  keys  into  an  empty  tree  has  an  expected 
internal  path  length  of  In  «  1.386«lgn.*  Randomly  deleting  k  nodes  from  such  a  tree  yields 
a  tree  whose  expected  internal  path  length  is  Unfortunately,  performing  insertions  after 

deletions  does  not  produce  binary  trees  whose  internal  path  length  is  predicted  by  this  function. 
A  theoretical  explanation  of  the  effect  of  performing  deletions  and  then  insertions  on  binary  tiroes 
is  still  lacking.  [Knuth  73,  Section  6.3.2] 

This  paper  presents  an  empirical  study  on  the  effect  of  applying  random  insertions  and 
deletions  to  random  binary  search  trees  and  analyses  results  of  experiments  comparing  asymmetric 
and  symmetric  deletion  algorithms.  In  a  previous  empirical  study,  Knott  [Knott  75]  suggests  that 
the  expected  internal  path  length  tends  to  decrease  after  repeated  insertions  and  asymmetric 
deletions.  In  this  study,  the  .large  number  of  insertions  and  asymmetric  deletions  performed 
suggests  that  the  expected  internal  path  length  first  decreases  but  eventually  begins  to  mcreeee. 
For  sufficiently  large  trees,  expected  internal  path  length  becomes  worse  than  that  of  a  random 
tree.  However,  experiments  using  the  symmetric  deletion  algorithm  show  that  performing  a  large 
number  of  insertions  and  symmetric  deletions  decreases  the  expected  internal  path  length  (making 
the  trees  better  than  random). 

Section  2  describes  the  insertion  and  deletion  algorithms  used  In  this  study  and  provides  an 
overview  of  some  of  the  previous  work  in  this  area.  The  statistics  used  in  this  study  are  defined 
in  Section  3.  Section  3  also  mentions  a  few  specifics  about  bow  the  data  was  gathered.  The 
observations  in  Section  4  give  an  interpretation  of  the  data  and  the  conclusions  are  summarised  in 
Sections. 


2.  Background 

Assertion  Aigtritkm:  The  structure  of  Unary  trees  naturally  leads  to  one  insertion  algorithm.  Tb 
insert  a  node  into  a  Unary  tree  (known  not  to  contain  the  node),  compare  the  new  and  current 
fays  and  inert  the  node  into  the  left  or  right  subtree,  whichever  maintains  the  invariant  of  the 
data  structure.  The  PWeal  code  for  this  aljprithm  is  provided  in  Figure  1,  below.  For  farther 

f  Throughout  tiris  pepw,  lgs  taolu  bgae. 
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PROCEDURE  Insert (VAR  root  :  RodePtr;  x  :  Datatype); 


»n 

IP  root  *  NIL 

ran  non 

REV (root);  roott . dots  :*  x; 
roott.lChild  :*  MIL;  roott. rChild  :*  MIL 

U D 

ELBE  IP  x  <  roott. data 

ran  laoort (roott.lChild.  x) 

ELSE  laoort (roott. rChild.  x) 

BED; 

Figaro  It  The  insertion  procedure, 
explanation  see  [Knuth  73,  Section  6.2.2,  Algorithm  T]. 

Unlike  insertion,  there  are  many  reasonable  deletion  algorithms  from  which  to  choose.  This 
paper  describes  experiments  with  Knuth’s  asymmetric  deletion  algorithm  and  a  trivially  modified 
version  of  this  algorithm  to  make  it  symmetric. 

Asymmetric  Deletion  Algorithm:  A  node's  oocctsaor  is  defined  to  be  the  smallest  node  in  the  right 
subtree.  Similarly  a  node's  predecessor  is  defined  to  be  the  largest  node  in  the  left  subtree.  Tb 
delete  a  node  from  a  binary  tree,  replace  the  node  with  its  successor,  is.,  the  node  that  contains 
the  next  larger  key.  The  Pascal  code  for  this  algorithm  is  given  in  Figure  2,  below.  Figure  4*  shows 
examples  of  the  insertion  algorithm  and  this  deletion  algorithm  applied  to  a  particular  binary  tree; 
for  further  explanation  see  [Knuth  73,  Section  6.2.2,  Algorithm  D]. 

Symmetric  Deletion  Algorithm :  To  delete  a  node  from  a  binary  tree,  replace  the  node  with  its 
successor  or  predecessor.  Alternately  choose  the  successor  and  predecessor  (so  that  half  the  time 
the  RlghtOelete  routine  is  called  and  half  the  time  a  suitably  modified  version  of  tills  routine, 
Lef tOelete,  is  eaDed). 

/ 

Consider  building  a  Unary  tree  using  n  keys  chosen  randomly  from  a  uniform  distribution 
(i«.,  all  Hi  permutations  of  the  keys  are  equally  likely).  There  are  (**)/(n  + 1)  possible  shapes  for 
this  tree  (Knuth  66,  Section  S.3.4.4],  each  with  some  probability  of  occurring;  call  the  distribution 
D«.  By  this  definition,  inserting  a  new  node  into  this  Unary  tree  would  yield  a  tree  of  rise  n  + 1 
whose  shape  occurs  with  a  probability  defined  by  Dn+ Binary  trees  whose  distribution  of  dupes 
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PROCEDURE  RightDelete (VAR  root  :  RodoPtr;  x  :  Datatype); 

VAR  copy,  successor,  succPtr  :  KodePtr; 

no  H 

IF  x  <  roott . data 

THEM  RightDelete (roott . IChild,  x) 

ELSE  IF  x  >  roott. data 

THEM  RightDelete (roott. rChild,  x) 

else  non 

copy  :*  root; 

IF  roott. rChild  *  MIL 
{  Cate  I:  There  it  no  tueeeteor.  } 

THEM  root  :=  roott , IChild 
ELSE  IF  roott. rChlldt. IChild  =  MIL 
{  Cate  II:  The  tueeettor  it  the  right  child.  } 

them  non 

roott . rChlldt . IChild  :=  roott. IChild; 
root  :=  roott. rChild 

SMD 

{  Case  III:  The  tueeettor  it  the  leftmoet  chili  m  the  right  subtree.  } 

else  non 

succPtr  ;*  roott. rChild; 

WHILE  sueePtrt.lChlldt. IChild  <>  MIL  DO 
succPtr  :=  succPtrt . IChild; 
successor  :=  succPtrt. IChild; 
succPtrt. IChild  :*  succeseort. rChild; 
successor t. IChild  :=  roott. IChild; 
suecessort. rChild  :»  roott. rChild; 
root  :■  successor 

EMD; 

DISPOSE (copy) 


Figure  2t  The  asymmetric  deletion  procedure, 
is  are  called  random  binary  treat. 

Thomas  Hibbard  [Hibbard  62]  proved  that  deleting  a  random  node  («.«.,  where  each  node  has 
an  equal  probability  of  being  deleted)  from  a  binary  tree  of  rise  n,  with  distribution  of  shapes  D«, 
yields  a  tree  with  a  distribution  of  shapes  Dn-i. 

Strangely,  performing  random  insertion  and  deletion  operations  on  a  random  tree  does  not 
pr  eserve  this  distribution  of  shapes.  Consider  building  a  Unary  tree  of  rise  n,  as  described  above. 
Since  the  keys  are  chosen  from  a  uniform  distribution,  the  probability  of  inserting  a  new  node  in 
any  particular  inter  key  gap  is  After  one  random  deletion,  the  distribution  of  shapes  will  be 
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Dn-i,  bat  the  probability  of  Inserting  a  new  node  where  the  deleted  node  used  to  be  will  S+T 
(while  all  other  places  are  still  j&j).  Knuth  [Knuth  73,  Section  6.2.2]  describes  this  phenomenon 
as  follows: 

The  shape  of  the  tree  is  random  after  deletions,  but  the  relative 
distribution  of  values  in  a  given  tree  shape  may  change,  and  it 
turns  out  that  the  first  random  insertion  after  a  deletion  actually 
destroy*  the  randomness  property  on  shapes.  This  startling  fact, 
first  observed  by  Gary  Knott  in  1972,  must  be  seen  to  be  believed. 

Empirical  evidence  suggests  strongly  that  the  path  length  tends  to 
decrease  after  repeated  deletions  and  insertions,  so  the  departure 
from  randomness  seems  to  be  in  the  right  direction;  a  theoretical 
explanation  for  this  behavior  is  still  lacking. 

Knuth  feds  that  binary  trees  tend  to  improve  because  "path  length  tends  to  decrease.”  One 
way  to  compare  binary  trees  is  to  measure  their  internal  path  lengths.  The  internal  path  length 
of  a  tree  is  defined  as  the  sum  of  the  depths  of  the  nodes  in  the  tree, 

^  distance(rooi,  t). 

For  a  random  tree  containing  n  nodes,  the  expected  IPL  is  denoted  as  /«  and  the  expected  number 
of  comparisons  in  a  successful  search  is  denoted  as  C«.  Knuth  [Knuth  73,  Section  6.2.2]  gives  the 
expected  number  of  comparisons  in  a  successful  search,  Cn,  as  approximately  equal  to  1.388  Ig*. 
Substituting  into  the  relation  /*  =  n(Cn  —  1),  on*  obtain#  the  approximation  /„  m  1.386nlgn. 
A  distribution  of  trees  is  said  to  be  "better  than  random”  when  the  expected  IPL  is  less  than  L 
(since  the  expected  number  of  comparisons  is  proportional  to  the  IPL). 

8.  Methodology 

If  a  random  sequence  of  insertions  and  ddetions  were  applied  to  a  random  tree  of  rise  «,  the 
resulting  tree  would  probably  not  have  the  same  number  of  nodes.  The  original  tree's  IPL  would 
therefore  not  be  directly  comparable  with  the  IPL  of  the  new  tree.  In  this  study,  ssquenees  of 
insertion/ deletion  pairs  (I/D  pairs)  are  applied  to  random  trees.  Since  the  resulting  tree  always 
has  the  same  rise,  it  is  easy  to  see  whether  any  improvement  has  been  made.  (Knott's  dpta  was 
also  obtained  by  using  I/D  pairs.)  The  first  step  of  the  simulation  is  therefore  to  insert  n  nodes 
into  an  empty  tees,  after  which  successive  pairs  of  insertions  followed  by  deletions  are  performed. 

Let  TPL^i  denote  the  measured  mean  IPL  of  an  n*node  binary  tree  after  applying » I/D  pairs. 


Figures  5  through  10  showTFI*^//,  plotted  as  s  function  of  *.  This  ratio  shows  the  improvement 
of  tiie  resulting  tree's  expected  IPL  as  a  fraction  of  the  random  tree's  expected  IPL. 

The  deletion  algorithm  given  above  generally  replaces  the  node  to  be  deleted  with  its  successor, 
the  "left- most  node  in  the  right  subtree”.  The  left  and  right  subtrees  are  treated  differently  and, 
as  observed  below,  this  appears  to  have  a  profound  affect  on  the  behavior  of  binary  trees.  Such  a 
deletion  algorithm  is  called  an  asymmetric  deletion  algorithm.  The  symmetric  deletion  algorithm 
which  is  examined  in  this  study  is  a  trivially  modified  version  of  the  asymmetric  algorithm.  This 
symmetric  algorithm  alternately  replaces  the  node  to  be  deleted  with  its  successor  or  its  predecessor. 
The  algorithm  requires  a  small  amount  of  state  information,  but  similar  results  have  been  obtained 
by  randomly  replacing  the  node  to  be  deleted  by  its  successor  or  predecessor. 

To  ensure  that  the  results  were  not  an  artifact  of  the  random  number  generator,  simulations 
were  performed  on  both  DEC- 20s  and  Ferqs.  In  the  DEC-20  simulations  the  random  number 
generator  used  the  linear  congruential  method  to  produce  36-bit  pseudorandom  numbers  (Knuth 
69,  Section  3.2].  The  random  number  generator  for  the  Ferqs  is  the  feedback  shift- register 
peeudorandom  number  generator  as  described  in  (Lewis  73].  The  data  presented  in  this  paper 
was  generated  on  the  Fsrqs  and  took  about  one  month  of  CPU  time,  but  similar  results  were 
obtained  for  the  smaller  trees  on  the  DEC-20S. 

The  outer  loop  of  the  simulation  program  is  very  simple.  First,  build  a  tree  with  telse  nodes, 
then  gather  data  before  and  after  each  interval  of  isise  I/D  pairs. 

POS  1  :*  1  TO  tsise  DO  BndXnsert; 

...  gather  data ... 

POS  1  :»  1  TO  intervals  DO  BSOZM 

POS  j  :*  1  TO  isise  DO  BBQH  BndXnsert;  SadDelete  HD; 

...  gather  data ... 

HD; 

PreeTree; 

Figure  St  The  inner  loop  of  a  simulation. 

4.  Observations 

Tbs  graphs  in  Figurss  S  and  6  show  the  expected  Internal  path  length  of  n-node  binary 
tress  plotted  against  the  number  of  Insertion  and  asymmetric  deletion  pairs.  Initially,  TPZ^t 
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dcmem,  as  Knott  and  Knuth  observed.  Alter  some  critical  point,  though,  IPLn.i  starts  to 
increase,  eventually  levelling  off  after  approximately  na  1/D  pairs.  Figure  7  is  a  comparison  chart 
in  which  TPI*,,//*  is  plotted  as  a  function  of  t’/n*  for  each  of  the  values  of  n  tested.  (The  latter 
ratio  normalises  the  *-axis.) 

Perhaps  the  most  significant  observation  is  that  as  n  increases  so  does  the  asymptotic  value 
for  7FLn,i/In>  Since  binary  trees  can  be  modeled  by  Markov  Chains,  and  any  binary  tree  may  be 
obtained  by  applying  some  combination  of  I/D  pairs  to  any  other  binary  tree,  the  lim»_oo  IPL»,i 
exists  [Ross  70,  Theorem  4.9].  Figure  7  suggests  that 

Urn  7FL»,i  >  In 

%-»ao 

for  sufficiently  large  values  of  n  (roughly  greater  than  128).  Thus  binary  trees  seem  to  become 
"worse  than  random”  after  many  insertions  and  deletions. 

The  comparison  chart  in  Figure  11  shows  the  asymptotic  values  of  TFZnti/In  for  both  deletion 
algorithms  plotted  against  n  (on  a  log  scale).  The  data  given  in  Table  1  was  obtained  by  summing 
all  the  IPLn,i  and  TPZ*^,  when  s’  >  »a. 


A 

I 

f 

ip 

Variance 

64 

6000 

0.97 

0.01652 

128 

6800 

1.00 

0.01340 

256 

2300 

1.06 

0.00985 

512 

1200 

1.16 

0.00970 

1024 

750 

1,30 

0.01013 

2048 

5340 

1.49 

0.00771 

Table  li  Data  for  Asymmetric  Deletions. 

The  asymmetric  curve  appears  to  be  quadratic.  A  least- squares  multiple  regression  weighted  by 
the  inverse  of  the  variance  yields  the  following  approximation: 

Um  w 0.0202 lg*n- 0.241  lg»»  +  IM. 

<-*e  In 

Substituting  In  *•  1.386a Ign  we  obtain 

Um  JPLn,i  »0.0280»lg,»-0J34»lg,»  +  244nlgm. 


.v.V 


The  graph*  in  Figure*  8  and  9  show  the  corresponding  plots  of  the  data  in  Table  2  for  the 
expected  internal  path  length  for  symmetric  deletions. 


n 

Samples 

Variance 

64 

6000 

0.905 

0.01654 

128 

6800 

0.890 

0.00916 

256 

2300 

0.888 

0.00615 

512 

1200 

0.890 

0.00347 

1024 

750 

0.881 

0.00235 

2048 

5340 

0.883 

0.00269 

Table  2t  Data  for  Symmetric 

Deletions. 

The  TPLnti  decreases  initially,  as  in  the  ease  of  asymmetric  deletions,  but  the  asymptotic  value 
of  the  expected  internal  path  length  seems  to  remain  lower  than  that  of  a  random  tree.  The 
comparison  charts  in  Figures  10  and  11  indicate  that 

1  >  lim  m  0.88 

<-«•  In 


or  that 


/„  >  Um  lPLn.%  **  1.22»lgn. 

The  comparison  chart  in  Figure  11  shows  the  asymptotic  value  of  7FZ*,,-  slowly  decreasing  as  n 
increases.  Since  a  binary  tree  with  n  nodes  cannot  have  an  internal  path  length  less  than  that  of 
a  perfect  tree,  we  know  that 

^lim  7PZ*,i  =  ft(nlogn). 


The  expected  internal  path  length  of  a  random  binary  tree  !*/«■*  9(nlogn).  Empirical 
evidence  suggests  that  performing  many  insertion  and  asymmetric  deletions  yields  binary  trees 
with  an  expected  internal  path  length  of  77Z*,<  »  0(nlog*n).  Thus  perforating  asymmetric 
deletions  causes  Unary  trass  to  become  more  unbalanced.  Amasingiy,  the  expected  path  length 
does  not  increase  by  a  constant  factor,  but  rather  by  a  factor  of  log1  n.  However,  experiments  shew 
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that  the  symmetric  deletion  algorithm  improves  the  balance  of  Unary  trees  leaving  the  expected 
internal  path  length  9(nlogn),  but  with  a  smaller  constant  coefficient  than  the  expected  internal 
path  length  of  a  random  binary  tree. 

Because  this  is  an  empirical  study,  the  above  conclusions  can  only  be  conjectures.  No  one  has 
provided  a  theoretical  explanation  of  the  behavior  of  a  binary  tree’s  path  length  after  applying 
deletions  and  then  insertions.  There  is  no  proof  that  the  asymptotic  value  of  7 PE*,,-  is  less  than 
Jn  when  performing  random  insertions  and  symmetric  deletions  or  that  the  asymptotic  value  of 
IPLn,i  is  greater  than  In  when  applying  insertions  and  asymmetric  deletions. 

In  doting,  it  should  be  noted  that  the  results  of  this  study  will  have  little  impact  on  the  use 
of  binary  trees  in  practice.  It  takes  approximately  1.5  million  random  insertions  and  asymmetric 
deletions  to  make  a  2048-node  binary  tree  worse  than  a  random  tree,  and  4  million  before  its 
expected  internal  path  length  reaches  the  asymptotic  value  (which  is  just  50%  worse).  When  so 
many  operations  are  required,  other  data  structures  are  probably  more  appropriate. 
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Figure  4:  Examples  of  Insertion  and  Asymmetric  Deletion. 
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delete  43 
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delete  54 
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delete  19 
(case  III) 
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Figure  11 
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